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RELAXED, MORE OPTIMUM TRAINING FOR 
MODEMS AND THE LIKE 

BACKGROUND OF THE INVENTION 

5 

1. Field of the Invention 

The present invention relates to filter adaptation, for example in 
communications such as wireline communications. 

10 2. State of the Art 

Broadband communications solutions, such as HDSL2/G.SHDSL 
(High-Speed Digital Subscriber Line) are increasingly in demand. The ability to 
achieve high data rates (e.g., 1.5Mbps and above) between customer premises and 
the telephone system central office over existing (unconditioned) telephone lines 

15 requires exacting performance. Various components of a high-speed modem that 
contribute to this performance require training, e.g., a timing section (PLL, or 
phase lock loop), an adaptive equalizer, an adaptive echo canceller. Typically, these 
components are all trained in serial fashion, one after another, during an initial 
training sequence in which known data is transmitted between one end of the line 

20 and the other. 

Equalization is especially critical for HDSL2/G.SHDSL2 modems, which 
are required to operate over various line lengths and wire models and wirelines with 
and without bridge taps, with extremely divergent cross-talk scenarios. In general, 
intersymbol interference (ISI), which equalization aims to eliminate, is the limiting 

25 factor in XDSL communications. Hence, good equalization, characterized by the 
ability to accurately compute the optimal channel equalizer coefficients at the 
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start-up phase of the modem and adaptively update those coefficients to 
accommodate any change in the level of cross-talk, is essential to any 
HDSL2/G.SHDSL system. 

Known training methods for high-speed modems suffer from various 
5 disadvantages. Existing commercial products invariably use a Least Mean Squares 
(LMS) training algorithm, which is assumed to converge to an optimal training 
solution. The LMS algorithm is well-known and has generally been found to be 
stable and easy to implement. Conventional wisdom holds that the steady-state 
performance of LMS cannot be improved upon. Despite the widespread use of LMS 

10 and its attendant advantages, the adequacy of performance of LMS is being tested 
by the performance requirements of high-speed modems. 

Nor are the alternatives to LMS particularly appealing. Other proposed 
algorithms have chiefly been of academic interest. The Recursive Least Squares 
(RLS) algorithm, for example, requires a far shorter training time than LMS 

15 (potentially one tenth the training time needed for LMS), but RLS entails 

exceedingly greater computational complexity. If N is the total number of taps in an 
adaptive filter, then the complexity of RLS is roughly N 2 , as compared to 2N for 
LMS. Also, RLS is less familiar and less tractable, suffering from stability 
problems. 

20 An improved RLS algorithm ("fast RLS") considerably reduces the 

computational complexity of RLS, from N 2 to 28N. The original fast RLS algorithm 
is described in Falconer and Ljung, Application of Fast Kalman Estimation to 
Adaptive Equalization, IEEE Transaction on Communications, Vol. COM-26, No. 
10, October 1978, incorporated herein by reference. The fast RLS algorithm, 

25 however, requires that training be performed on contiguous data symbols. If 

training is performed "on-line," then a high-performance processor is required to 
perform training computations at a rate sufficient to keep pace with the data rate, 
e.g., 1.5Mbps or greater. Although the computational demand (demand for MIPs) 
"spikes up" during training, once training is completed, computational demands are 
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modest. If training is performed "off-line" using stored data samples, then the 
processor need not keep up with the data rate, reducing peak performance 
requirements. However, a potentially long sequence of training data must be stored 
to satisfy the requirement of the algorithm for contiguous data, requiring a sizable 
memory. Again, the memory requirement, like training itself, is transient. Once 
training has been completed, the need for such a large memory is removed. 

Apart from training, because communications channels vary over time, 
continuous or periodic filter adaptation is required. In the case of rapidly varying 
channel conditions, as in wireless communications and especially mobile wireless 
communications, and in the case of especially long filters relative to adaptation 
processing power, the use of RLS is indicated. In wireline communications, these 
conditions are typically not present. Even in the demanding case of 
HDSL2/G.SHDSL, filter lengths are moderate and channel variation can be 
considered to be slow. To applicant's knowledge, all wireline modems use LMS 
"on-line" for non-training filter adaptation. 

Although the error criteria used by the LMS and RLS algorithms differ, the 
prevalent mathematical analysis of these algorithms suggests that the algorithms 
converge to the same solution, albeit at different rates. LMS uses mean squared 
error, a statistical average, as the error criterion. RLS eliminates such statistical 
averaging. Instead, RLS uses a deterministic approach based on squared error (note 
the absence of the word mean) as the error criterion. In effect, instead of the 
statistical averaging of LMS, RLS substitutes temporal averaging, with the result 
that the filter depends on the number of samples used in the computation. Although 
the prevalent mathematical analysis predicts equivalent performance for the two 
algorithms, the mathematical analysis for LMS is approximate only. Although a 
mathematically exact analysis of LMS has recently been advanced, the 
overwhelming complexity of that analysis defies any meaningful insight into the 
behavior of the algorithm and requires numeric solution. 
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There remains a need, particularly in high-speed wireline communications, 
for a filter adaptation solution the overcomes the foregoing disadvantages, i.e., that 
achieves greater optimality without requiring undue computational resources. 



5 SUMMARY OF THE INVENTION 

The present invention, generally speaking, uses adaptation based on a least 
squares error criterion to improve performance of a wireline modem or the like. In 
accordance with one aspect of the invention, a high-speed, broadband, wireline 
modem includes an adaptive equalizer having both a training mode and a 

10 decision-directed, non-training mode, the adaptive equalizer including a memory for 
storing received signal samples; a forward path coupled to receive the signal 
samples, the forward path including a forward filter and a decision element; a 
feedback path coupled between an output of the decision element and an input of the 
decision element, the feedback path including a feedback filter; wherein the 

1 5 combined length of the forward filter and the feedback filter is moderate relative to 
adaptation processing power; and an adaptation circuit or processor for adapting the 
forward filter and the feedback filter is based on a least squares error criterion, as 
distinguished from a least mean squares error criterion. A lower noise floor is 
thereby achieved. The resulting improved noise margin may be used to "buy" 

20 greater line length, better quality of service (QoS), higher speed using denser 

symbol constellations, greater robustness in the presence of interference or noise, 
lower-power operation (improving interference conditions) or any combination of 
the foregoing. In accordance with another aspect of the invention, an adaptation 
algorithm based on the least squares error criterion is provided for use during 

25 training of a high-speed, broadband modem. The algorithm converges to a more 
optimal solution than LMS. Furthermore, the algorithm achieves a high level of 
robustness with decreased computational complexity as compared to known 
algorithms. The algorithm is well-suited for fixed-point implementation. 
Significantly, unlike known algorithms, the algorithm allows for reinitialization and 



-4- 



PATENT 

ATTORNEY'S DOCKET NO. 032478-005 

the use of non-contiguous data. This features allows for a wide spectrum of system 
initialization strategies to be followed, including strategies in which training of 
multiple subsystems is interleaved to achieve superior training of multiple 
subsystems and hence the overall system, strategies tailored to meet a specified 
5 computational budget, etc. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention may be further understood from the following 

description in conjunction with the appended drawings. In the drawings: 

10 Figure 1 is a block diagram showing portions of a wireline communications 

transceiver with which the present invention may be used; 

Figure 2 is a block diagram of a decision feedback equalizer of Figure 1 ; 

15 Figures 3A-3D are graphs illustrating superior steady-state performance 

achieved using the RLC-fast algorithm; 

Figure 4 is a chart summarizing the original RLS-fast algorithm and the 
computational complexity of the algorithm; 

20 

Figure 5 is a chart summarizing the RLC-fast algorithm and the 
computational complexity of the algorithm; 

Figure 6 is a diagram illustrating conventional training using a single 
25 contiguous block of data; 

Figure 7 is a diagram illustrating the inability of conventional adaptation 
techniques to operate on discontiguous blocks of data; 

30 Figure 8 is a diagram illustrating the ability, afforded in accordance with one 

aspect of the present invention, to operate on discontiguous blocks of data; 

Figure 9 is a diagram illustrating further particulars of the reinitialization 
Figure 8; 

35 

Figure 10 is a flowchart illustrating cooperative, interdependent training of 
multiple subsystems in accordance with one aspect of the present invention. 
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DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 
Referring to Figure 1 , a block diagram is shown illustrating portions of a 
wireline communications transceiver with which the present invention may be used. 
The wireline communications transceiver includes a control section, a transmit 
5 section, a receive section, and a hybrid section. Within the transceiver, particularly 
the receive section, may be various subsystems that require training by the control 
section, for example, a PLL (which may be of digital implementation), an echo 
canceller and an adaptive equalizer. The training of these subsystems is an 

_ interdependent process. For example, some initial training of the PLL may be 

Q 

yi 10 required prior to any other training. This initial training, however, may not achieve 
as good results as may be obtained following some training of one or more of the 
other subsystems, i.e., the echo canceller and adaptive equalizer. As described 
H hereinafter, the present training methods allow for coordinated, interdependent 

J training of multiple sub-systems, offering the potential of substantially improving 

15 overall system performance, 
ff Since ISI, which equalization aims to eliminate, is typically the limiting 

o factor in XDSL communications, the focus of the following description will be 

w equalizer training. The same principles, however, may be applied to the training of 

various different communications subsystems. 
20 Referring to Figure 2, a block diagram is shown of an adaptive decision- 

adaptive equalizer (DFE) suitable for use in the wireline communications transceiver 
of Figure 1. An input signal from a communications line (e.g., an 
HDSL2/G.SHDSL line) is 2x oversampled. The communications line forms the 
channel for which equalization is to be performed. The samples are input to an 
25 adaptive feedforward filter. In conjunction with the filter, a decimation operation is 
performed. The resulting data stream is applied to a decision element, or "slicer," 
which produces an output of the equalizer. The output is applied to an adaptive 
feedback filter, an output of which is summed into the input to the slicer. The DFE 
structure per se is known. Data decisions are filtered by the feedback filter to 
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eliminate ISI arising from previous pulses. Because the feedback filter compensates 
for this "past" ISI, the feedforward filter need only compensate for "future" ISI. 
The equalizer of Figure 2 differs from conventional equalizers in that the filter 
adaptation is performed using a variant of RLS, tfeinitializable Low Complexity fast 
5 least squares (RLC-fast), described hereinafter. 

An important, even startling, discovery of the present inventors is that 
RLS-type algorithms, apart from converging faster, converge to a lower noise floor 
than the LMS algorithm. That is, better equalization can be performed using the 
RLS-type algorithms than with LMS. This result is illustrated in Figure 3. Only in 

10 the exacting environment of high-speed, wide-band wireline modems such as 
HDSL2/G.SHDSL does this important difference come to the fore. In fact, 
experiments have shown that in this environment, even if an adaptive filter is set to 
a near-optimal solution obtained using an RLS-type algorithm, if the LMS algorithm 
is then used, the filter settings will actually diverge from the near-optimal solution. 

1 5 A great incentive therefore exists to use an RLS-type algorithm instead of the 

prevalent LMS algorithm. Impediments to the use of RLS-type algorithms in this 
environment include computational complexity and instability. 

Although the computational complexity of the fast RLS algorithm is greatly 
reduced, it remains significant. The computational complexity of adaptation is 

20 measured in terms of the number of multiplications and/or divisions required per 
filter coefficient times the total number of filter coefficients N for the structure. 
Although the present invention may be used with equalizers of other structures and 
in other applications of adaptive filters, the invention will be described with respect 
to the exemplary embodiment of Figure 2. 

25 Whereas the original fast RLS algorithm requires 28N multiplications and 

matrix inversion, the computational complexity of the present "RLC-Fast" 
algorithm is 22N multiplications and involves 2 divisions. This improvement in 
computational efficiency is achieved by efficiently rewriting the original algorithm. 
Note that there are algorithms with computational complexity as low as 17N; 
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however, they are very susceptible to error accumulation, and are hard to stabilize 
without the use of additional correction terms. In the case of a fixed-point equalizer 
implementation, stability is crucial for overall system reliability. The computational 
complexity of RLC-fast is reduced without significantly degrading the stability of 
5 the algorithm. 

Referring to Figure 4, a chart summarizing the original fast RLS algorithm 
is shown. 

Referring to Figure 5, a corresponding chart summarizing the RLC-fast 
algorithm is shown. Figure 5 follows a different but similar notation than that of 
10 Figure 4, as set forth in the following table: 



Table 1 



Quantity 


Prior Art Algorithm 


New Algorithm 
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Backward predictor coefficients 
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performed 
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Derivation of the RLC-fast algorithm from the original algorithm and the 
computational advantages of the RLC-fast algorithm are described in detail in 
Appendix B. 

A fixed-point implementation of the RLC-fast algorithm is desirable to 
5 reduce computational load and hence increase the speed of the algorithm, as well as 
to avoid the cost and increased power consumption of a floating-point processor. 
Because of the underlying stability issues of RLS-type algorithms, such a fixed-point 
implementation must be carefully considered. The binary point cannot be assumed 
to be at the beginning just after the sign bit~i.e., all numbers within [-1, 1)— to 

10 avoid saturation of the variables, since, for some of the internal variables, the actual 
values may become larger than 1 . 

Key elements for successful implementation of the RLC-fast algorithm 
include: (1) Appropriate scaling of the input variables; (2) the position of the binary 
point for internal variables; (3) efficient internal scaling of the variables after 

1 5 multiplication and division to reduce loss of precision; (4) complete analysis of the 
dynamic range of various internal variable; and (5) judicious choice of delta (5) and 
lambda (A,) for convergence speed and stability. 

A currently preferred implementation assumes 32-bit precision for all the 
variables, with all the numbers being of signed integer form. The integer numbers 

20 are given a floating point interpretation in which the leading bit is the sign bit, 
followed by a 5-bit integer part and a 26-bit fractional part. Multiplication and 
division are performed assuming the foregoing interpretation of the integer 
variables. There occur two divisions per update. Both are computed as 1/(1 -1- x) 
instead of 1/x to reduce the loss in precision. 

25 A more detailed description of the RLC-fast algorithm is given in Appendix 

A (implemented in fixed point arithmetic for the DSP TI-C6x). 

Due to the high data rate of the HDSL2/G.SHDSL system, for moderate-size 
problems (N about 100), the RLC-fast algorithm, even with its reduced complexity, 
poses a high computational burden on a typical processor (say, an X MIPS 
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processor). In many modems, RLC-fast will be executed only once at the start-up 
phase of the modem and will not be used in the steady-state, which is the normal 
operating state for the modem. 

Hence, although it may be feasible to deploy a high-speed, power-hungry 
DSP for on-line execution of RLC-fast, such a measure adversely impacts power 
consumption and may not be cost effective. As a result, off-line implementation of 
RLC-fast will often be the preferred alternative. 

However, off-line implementation itself raises problems. The RLS-type 
algorithm requires a certain data length in order to converge to a near optimal value. 
The convergence time is a function of the so-called forgetting factor. An aggressive 
choice of the forgetting factor can be used to reduce the required data length but at 
the cost of stability. 

A reasonable choice for the forgetting factor may require a long data length 
(say, 100N) for convergence. This in turn implies a large storage requirement even 
for a moderate size problem. Once again, if this memory is only used during the 
start-up phase, a straight-forward implementation wastes large amounts of silicon 
and results in inefficient design. 

The original fast RLS algorithm offers no solution to the foregoing problem. 
Referring more particularly to Figure 6, the original fast RLS algorithm requires the 
input data stream to be contiguous. If there is a break in the input data stream, the 
only way to use the new data in the original approach is to restart the algorithm all 
over again as illustrated in Figure 7. Of course, the algorithm can be trained with 
smaller size blocks of data, but only at the cost of reduced performance. That is, the 
advantage of Figure 3 would be sacrificed. 

To circumvent the requirement of a contiguous data stream, RLC-fast uses a 
re-initialization scheme that allows the use of a non-contiguous data block without 
restarting the algorithm. At start-up, the algorithm is initialized in the usual way. 
However, the algorithm can be stopped at any time and started at a later time with a 
new initialization. This manner of operations is illustrated in Figure 8. No 
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difference in performance is observed if individual data blocks are not too small 
(say, no smaller than ION). Hence, storage requirements may be reduced by an 
order of magnitude (e.g., ION instead of 100N). 

The particulars of re-initialization are illustrated in Figure 9. Instead of 
5 setting the intermediate variables to zero or a scaled identity matrix, the previous 
values are used for all variables except X fasr The variables A fasn F fasn K fasn b n , D fast 
and C fast are all stored for this purpose. 

The foregoing re-initialization capability allows for a store/process mode of 
operation. More particularly, even with the reduced complexity of RLC-fast, the 

10 amount of computation required for real-time processing of moderate size problems 
can be prohibitive for most DSPs due to the high data rate of the system. To 
alleviate this problem, a store/process mode of operation is followed in which, 
during the first half of a cycle, a small block of data (e.g., size ION) is stored, and 
during the second half of the cycle, the data is processed to update the filter 

15 coefficients. Instead of operating in real-time, since the data is stored, each update 
need not be finished within the sample time T. Instead, the computation can be 
distributed over multiple sample periods. 

One approach is to partition the computation of the update for each data 
sample in small enough segments such that an individual segment can be finished in 

20 one sample time. The smaller the partition, the less processing is required each 
sample period. Total time to finish the update increases. Hence, store/process 
operation, along with partitioning of the update computation, provides a flexible 
mechanism that allows for trade-off between processing load and total time to 
process a data block. Without the capability of re-initialization, this flexibility is not 

25 obtainable. 

The same flexibility may be extended from the adaptive equalizer or other 
isolated sub-system to the system as a whole, in such as way as to achieve not only 
great flexibility but also improved performance. In reality, the performance of each 
sub-system is interdependent on the performance of other sub-systems and should 
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not be viewed in isolation. Referring to Figure 10, for example, the performance of 
the clock recovery circuit of the PLL block is influenced by the performance of the 
echo canceller and vice versa. The same is true for the performance of the echo 
canceller and the equalizer. The better the echo cancellation is, the better the 
equalizer performance will be. As a consequence, there will be fewer errors in the 
decision. These more accurate decisions can be used to improve the echo 
cancellation performance. This improved echo cancellation can be utilized to reduce 
the zitter performance of the PLL. 

It will be appreciated by those of ordinary skill in the art that the invention 
can be embodied in other specific forms without departing from the spirit or 
essential character thereof. The presently disclosed embodiments are therefore 
considered in all respects to be illustrative and not restrictive. The scope of the 
invention is indicated by the appended claims rather than the foregoing description, 
and all changes which come within the meaning and range of equivalents thereof are 
intended to be embraced therein. 
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APPENDIX A 



A Detailed Description of EBC-Fastalgo 

Step 1 Initialization: 







o, 


(JVx3) 


Dfast 




o, 


{N x 3) 


Fjast 






Sil, (3x3) 


Kfast 




o, 


(N x 1) 


Xfast 




o, 


(JV x 1) 


Cfast 




o, 


(JV x 1) 


decision 




0. 


(1 x 1), 



Step 2 Data Acquisition: 

P3 = [X fasi {FFE LEN -2)X Ia3t {FFE LEN -l)X fast {N-l)l 
eps2 = [2/(2);y(l); decision]] 
decision = tx signal* 

Step 3 Update Computation: 

e — eps2 + Aj a3t Xf ast . 

e [0] =eps2 [0] ; 
e[l>eps2[l]; 
e[2]=eps2[2]; 
f or (ind=0 ; ind<N ; ind++) 

{ 

e [0]+= mpy n(A fast [ind] ,X f ast [ind] ) ; 
e [1]+=* mpy n(A fast Cind+N] ,X fast [ind] ) ; 
e [2]+= mpy n(A fast Cind+N 2] ,X fast [ind] ) ; 
} 

A fast = A fast — Kfaste T * 

f or (ind=0 ; ind<N ; ind++) 

{ 

ind2=ind+N ; 
ind3=ind+N 2; 
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APPENDIX A (continued) 



A fast [ind] -= mpy n(K fast [ind] ,e[0]) ; 
A f ast[ind2] -= mpy n(K fast [ind] ,e[l] ) ; 
A fast[ind3] mpy n(K fast [ind] ,e[2]) ; 

} 



Equivalently, 

e p = e(l - Kf ast Xf ast ). 

temp fast=inp(K fast,X fast,N) ; 
temp fast=FIXED ONE-temp fast; 
ep[0]=mpy n(temp fast,e[0]) ; 
ep[l]=mpy n (temp fast ,e[l] ) ; . 
ep[2]=mpy n(temp f ast ,e[2]) ; 

Ffaat = ^iFfast- 

r 

mat scal(lambda,F fast,MM,F fast) ; 

Ffast 

u 

We use the following equivalent computation: 

<3 = FfaH i + J P F fa3t e p ' 

Ffast = Ffast — ^G T Ff aat . 

mat cvec (F fast , ep , M , t emp vecM) ; 

temp fast=inp(e,temp vecM,M) ; 

temp f ast^div wor(temp f ast+FIXED ONE) ; 

mat scal(temp fast , temp vecM,M, temp vecM) ; 

mat rvec(F f ast ,e,M,temp vecMl) ; 

outp(temp vecM,temp vecMl,M,temp matM) ; 

f or (ind=0 ; ind<MM ; ind++) 



= F fast - (F fa3t e p e T F fast ) (1 + , 
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APPENDIX A (continued) 

F fast [ind]=F fast Cind] -temp matMtind] ; 
} 

tentpvec = Ffast e p- 
mat cvec(F f ast ,ep,M, temp vecM) ; 

b n = Kjast + A fast * *3; 

f or (ind=0 ; ind<M ; ind++) 

{ 

bntind] =K fast Cind] ; 

bn[ind] + =mpy »(A fasttind] ,temp vecM[0]) ; 
bn[ind] + =mpy n(A fastCind + N] .temp vecM[l]) ; 
bn[ind] + =mpy n(A f ast [ind + N 2] ,temp vecMC2]) ; 

} 

m = Ml : 2); Ml •• ™***» " ^M^FFB LEN + 1 : N - Dl- 

m vec [0] =temp vecM CO] ; 
m vec[l]=temp vecMtH ; 
for(ind=2;ind<FFE LEN;ind++) 

{ ' 
m vec [ind] =bn [ind-2] ; 

i vec CFFE LEN3 =temp vecM [2] ; 
for (ind=FFE LEN+1 ; ind<W ; ind++) 

{ 

m vec[ind]=bn[ind-l] ; 

} 

M = [b n (FFE L EN - 1 : FFE L EN)\bn{N)]. 

mu [0] =bn [FFE LEN-21 ; 
mutl]=bn[FFELEN-l]; 
mu[2]=bn[N-l] ; 



- 15- 





PATENT 

ATTORNEY'S DOCKET NO. 032478-005 



APPENDIX A (continued) 



Xfast = [eps 2 (l : 2);X fast (l : FFE LEN ~ 2)iep3 2 (3);X fa3t {FFE LEN + 1 : N - 1)]. 
f or ( ind=FFE LEN- 1 ; ind> 1 ; ind — ) 



X fast[l]=eps2[l] ; 
X fast [0]=eps2[0] ; 

f or (ind=FBE LEN-1 ; ind>0 ; ind—) 
{ 

X f ast [ind+FFE LEN]=X fast [FFE LEN+ind-1] ; 
} 

X fast [FFE LEN]=eps2[2] ; 



eta[0]+=mpy n(D fast [ind] ,X fast [ind]) ; 
eta[l]+=mpy n(D fast[ind+N] ,X fast [ind]) ; 
eta[2]+=mpy n(D f ast [ind+N 2] »X fast [ind] ) ; 



Kfast = rn — D f a3t lM. 



{ 

X fast[ind}=X f ast [ind-2] ; 

} 



T) = P3 + Dj ast Xf ast , 



eta[0] =p3 [0] 
eta[l]=p3[l] 
eta [2] =p3 [2] 




We use the equivalent computation: 



Kfast = (m - (Dfastp))/{1 - »? r A*)i 

Dfast = Dfast " KfastV T - 
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APPENDIX A (continued) 

temp fast=inp(eta,mu,M) ; 

temp f ast=div wor (FIXED ONE-temp fast) ; 

f or ( ind=0 ; ind<N ; ind++) 

{ 

ind2=ind+N; 

ind3-ind+N 2; 

K fast [ind] =m vec [ind] ; 

K fast [ind] -=mpy n(D fast [ind] ,mu [0] ) ; 

K fast [ind] -=mpy n(D f ast [ind2] ,mu[l]) ; 

K fast [ind] -=mpy n(D f ast [ind3] ,mu [2] ) ; 

K fast [ind]=mpy n(K fast [ind] ,temp fast) ; 

} 

f or (ind=0 ; ind<N ; ind++) 

{ 

ind2=ind+N; 
ind3=ind+N 2; 

D fast [ind] -=mpy n(K fast [ind] ,eta[0] ) ; 
D fast [ind2] -=Tnpy n(K fast [ind] , eta [1] ) ; 
D fast [ind3] -=mpy n(K fast [ind] ,eta[2] ) ; 
} 

&est = CfastXf ast , 

xest=inp(X fast, C fast, N); 

error = decision — x est . 

error xest- decision-xest ; 

Cfast = C/ast + error * K fast . 
if ( ( (gdat .f astbuf f off set)*2)>FFE LEN) 
{ 

f or ( ind=0 ; ind<N ; ind++ ) 

{ C fast [ind] +=mpy n(K fast [ind] .error xest) ; 

} 

} 
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APPENDIX B 



Step-by-step derivation of the RLC-fast algorithm from fast algorithm 
Step 1 Initialization: 

Afast = 0, (JVx3) 1 
Dfast = 0, {NX 3) 
Ffast = Sil, (3x3) 
K fa3 t = 0, {Nxl) 

Xfast = 0, (Nxl) . 

C/^t = 0, (Nxl) 
decision — 0. (lxl). 

The only difference is in the definition of Ff ast . Instead of defining the variable Ef asU we 
define its inverse, i.e., Ff ast = Ej a \ t . Hence, $ = 1/6. 

Step 2 Data Acquisition: 

P3 = [X f ast(FFE LBN - 2) X fast (FFE LE N ~ 1) X fast (N - 1)], 
eps2 = [2/(2); y(l); decision]; 
decision = txsignaL 

This phase is identical to the data acquisition phase of fast algorithm. 

Step 3 Update Computation: 

The first two equations are identical: 

e = eps 2 + A$ ast X fasU 
Af a3 t = Af ast — Kf ast e T . 

Fast algorithm: 

ep = eps2 + Aj ast X fast . 
Using the last relation for Af ast , we write 

AfastXfast = \Af ast — Kf as te j Xf as u 

~ A^f ast Xf ast — eKj ast Xf ast , 
= e - esp 2 - eKj a3t X fasU 
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APPENDIX B (continued) 

where in the last equation we used the definition of e. Substituting this expression for 
Ajast-X/ast back to the equation for e p , we get an equivaJent express for e p as 

e p = e(l — Kj a3t Xf ast ). 

Motivation: Reduction in computation from mN to N -h 7tL 
Fast algorithm: 

Efast = ^Efast + e p e T , 

b n = Kfast + A/astE/ast^) 
^fast e Pi 



°n- = E fast e Pi 



First notice that, 



Now, 



°n = E fast e P> 

= (\E fa3t + e p e T ) ep, 



where in the last line we used the matrix inversion lemma and Aj = 1/A. Substituting 
Ffast — Ejast ^d rearranging, we get 

Cn = XiFf a3t — 7 ,f P . 

1 + e T XiF fa3t e p 

After resequencing, we obtain 

Ffast = X{F fastt 

Cn = F f as i — ^ , 

J l + e T Ff ast ep 

Ffast = Ff ast — Cne T Ff ast , 
b n = Kfast +Af a3t Cn. 

Motivation: Reduction in computation from rhN -£ 4m 2 multiplications and an (m x m} 
matrix inversion to roiV-b 4m 2 -h 2m multiplication and a scaler division. This also improves 
the stability of the algorithm. " 
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APPENDIX B (continued) 

Next, four equations axe identical: 

rn = [c n (l:2);b n (l:FFE LBN -2);c n (3);b n (FFE LE N + l:N -1)], 

ix = [b n {FFE LE N-l'.FFE LBN )-b n {N)\, 

Xfast = [eps2(l:2);X fast (l:FFE L EN-2);eps 2 {3))X fa3t (FFE LEN + 

V = P3 + Dj a3t Xf a3t . 

Fast Algorithm: 

D fa3t = (Dfast-m^il-^)- 1 , 

Kfast = Tn — Df ast fjL. 

Substituting the first relation for Dj a3t into the second equation and using matrix inversion 
lemma, we get 

Kfast = rn - (D fast - mr) T ){I - /zr/ T )"V, 

= m - (D fast - mrf) [l + 1 ^ T ^j M, 

= m - (D fasi ~ mrj T )fi + ± f^ Tf ^j » 
- m _ -mr} T )Li 
_ m - (D fast fi) 



Now, 



D ffia t = {Df a3t -mT ) T )(I- f iT ] T )- 1 , 

= .(^/-.-^(i+r^. 

- D i D fastV>V T f^T , m7} T fJLT] T \ 

T T 

7 mrf 
1 — 7? r /z 1 — r/ T /2 ' 

(m — DfastlM) T 



U fast • i „tp ~ „t^ j 



= -O/ast + 1 _ j* 



— Df a3t — KfastV ) 
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APPENDIX B (continued) 



where in the last step we substituted the express for Kf ast . Hence, 

Kfast = (m - (D f astl*))/(1 ~ V T Vh 
Dfast = & fast — K fasti 7 ~ 

Motivation: Reduction in computation from (m + 2)miV'H-m 2 multiplications and an (m x m) 
matrix inversion to (2m -h 1) N -r m multiplications and a scaler division.' 

The last three equation are identical: 

%est = CfastXfasU 

error = decision — x est) 
Gfast = Cfast + errorKf ast . 
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